Creating Multilingual Translation Lexicons with Regional Variations Using Web Corpora
نویسندگان
چکیده
The purpose of this paper is to automatically create multilingual translation lexicons with regional variations. We propose a transitive translation approach to determine translation variations across languages that have insufficient corpora for translation via the mining of bilingual search-result pages and clues of geographic information obtained from Web search engines. The experimental results have shown the feasibility of the proposed approach in efficiently generating translation equivalents of various terms not covered by general translation dictionaries. It also revealed that the created translation lexicons can reflect different cultural aspects across regions such as Taiwan, Hong Kong and mainland China.
منابع مشابه
Standards & best practice for multilingual computational lexicons: ISLE MILE and more
ISLE (International Standards for Language Engineering) is a transatlantic standards oriented initiative under the Human Language Technology (HLT) programme within the EU-US International Research Co-operation. It is a continuation of the European EAGLES (Expert Advisory Group for Language Engineering Standards) initiative, carried out through a number of subsequent projects funded by the Europ...
متن کاملIterative Learning of Parallel Lexicons and Phrases from Non-Parallel Corpora
While parallel corpora are an indispensable resource for data-driven multilingual natural language processing tasks such as machine translation, they are limited in quantity, quality and coverage. As a result, learning translation models from nonparallel corpora has become increasingly important nowadays, especially for low-resource languages. In this work, we propose a joint model for iterativ...
متن کاملLexicon+TX: rapid construction of a multilingual lexicon with under-resourced languages
Most efforts at automatically creating multilingual lexicons require input lexical resources with rich content (e.g. semantic networks, domain codes, semantic categories) or large corpora. Such material is often unavailable and difficult to construct for under-resourced languages. In some cases, particularly for some ethnic languages, even unannotated corpora are still in the process of collect...
متن کاملMTriage: Web-enabled Software for the Creation, Machine Translation, and Annotation of Smart Documents
Progress in the Machine Translation (MT) research community, particularly for statistical approaches, is intensely data-driven. Acquiring source language documents for testing, creating training datasets for customized MT lexicons, and building parallel corpora for MT evaluation require translators and non-native speaking analysts to handle large document collections. These collections are furt...
متن کاملA Cheap and Fast Way to Build Useful Translation Lexicons
The paper presents a statistical approach to automatic building of translation lexicons from parallel corpora. We briefly describe the pre-processing steps, a baseline iterative method, and the actual algorithm. The evaluation for the two algorithms is presented in some detail in terms of precision, recall and processing time. We conclude by briefly presenting some of our applications of the mu...
متن کامل